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Abstract 

Given a metric space (X, dx), the earth mover dis- 
tance between two distributions over X is defined as 
the minimum cost of a bipartite matching between 
the two distributions. The doubling dimension of a 
metric (X, dx) is the smallest value a such that every 
ball in X can be covered by 2 a ball of half the radius. 
A metric (or a sequence of metrics) is called dou- 
bling precisely if its doubling dimension is bounded. 
We study the efficient algorithms for approximating 
earth mover distance over doubling precisely metrics. 

Our first result is a near linear time (in the size 
of the X) algorithm for estimating EMD over dou- 
bling metric X, with a 0(otx) approximation ratio, 
where ax is the doubling dimension of X . Given a 
metric {X, dx), we can use 0(n . ) preprocessing time 
to create a data structure of size 0(n 1+e ), such that 
subsequent EMD queries can be answered in 0(n) 
time, with approximation ratio 0(otx/e)- 

Our second result is an encoding scheme, which is a 
weaker form of sketching. In an encoding scheme, dis- 
tributions are encoded, such that the EMD between 
two distributions can be estimated in sub linear time, 
given the encodings of the two distributions. In par- 
ticular, given (X,dx), by using 0{n 2 ) preprocessing 
time, every subsequent distribution fx can be encoded 
into F(fjb) in 0(n 1+e ) time. The query for EMD be- 
tween and v can be answered in 0(n e ) time, with 
approximation ratio 0(ax/e), given the two encod- 
ings F(n) and F{u). 

The encoding scheme has immediate applications. 
In a 2-player game where 1 player knows fi and the 
other knows v, there is a communication protocol 
with small communication complexity, through which 
the two players can approximate the EMD between 



\i and v. Another application is distance oracle, 
where we are given a metric (X, dx ) and s distribu- 
tions [ii,[i2,' • • j Ms over X, we can use 0(n 2 + sn 1+£ ) 
preprocessing time, creating a data structure of size 
0(n 1+e + sn), such the query for EMD between \xi 
and fij can be answered in 0{n e ) time, with approx- 
imation ratio 0(ax/e)- 

1 Introduction 

Given a finite metric (X, dx) and two multi-sets A, B 
of points in X with \A\ = \B\ = N, the Earth-Mover 
distance(or EMD) between A and B is defined as the 
minimum cost of a perfect matching, with respect to 
the cost function dx, i.e: 



EMD X (A,B) 



mm > 

tt:A^>B ^ 



d x (a,n(a)\\ 



where 7r ranges over all one-to-one mappings. 

The EMD metric is of significant importance in 
many applications. For example, in computer vision, 
EMD is used as a measurement for dissimilarity be- 
tween two images. The idea is to represent an image 
as a distribution on features (such as colors and color 
spectrum), with an underlying metric. Then, the 
EMD of two distributions of features can be used for 
navigating image databases and image retrieve ([2D], 

m, m, m, m, bsd- 

The exact distance can be computed in 0(N 3 ) 
time for general underlying metrics, using Hun- 
garian method(flSJ). For metric supported by a 
sparse graph, there is a scaling algorithm due to[T2] 
that runs in 0(y / jV r [|£ , | log(|V|iV)) time. Better 
algorithms are known for special metrics. EMD 
over a 2-dimensional plane can be computed in 
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0(N 5 ' 2 log° (1) N) time, due to Vaidya[23J. This time 
is further improved to 0(N 2+S ), for any S > 0, by 
Agarwal etc. [3J. 

Even for the special metric such as 2-dimensional 
plane, the exact EMD requires super-quadratic 
time, too expensive in many applications. This 
computational bottleneck motivates faster ap- 
proximation algorithms. For the EMD over 
2-dimensional plane, Agarwal and Varadarajan 
[21] showed 0(N 3 / 2 /e 2 log 5 (JV/e)) time (1 + e)- 
approximation, then they gave an improved al- 
gorithm with N 1+s log 0( - 1] N-time and log(l/<5)- 
approximation( 2 ). Indyk [14] further improved the 
running time to 0(N log *- 1 -* N) for constant approx- 
imation. Following [T3], Andoni etc. [4 provided a 
sketching algorithm with constant approximation. 

There has also been special interest for embedding 
EMD into normed spaces. Charikar [8] showed that 
EMD over (X,dx) can be embedded into l\ space 
with distortion a, if (X,dx) can be embedded into 
distribution of dominating trees with distortion a. 
Then, by [TT], which showed a is at most 0(\ogn) 
for a n-point metric, EMD over (X, dx) with can be 
embedded into l\ with distortion O(logn). Embed- 
ding into l\ can give approximation algorithms, with 
the ratio equal to the distortion of the embedding. 
However, the embedding of EMD into l\ has limita- 
tions : it has been showed in [T5] that embedding of 
EMD over grid [n] 2 must incur a distortion of at least 
O(Vlogn). 

In this paper, we are interested in approximat- 
ing EMD over more general metrics: metrics with 
bounded doubling dimensions. The doubling dimen- 
sion of a metric (X, dx ) is the smallest a such that 
every ball in X can be covered by 2 Q balls of half 
the radius. This is a richer family of metrics than 
the constant-dimension Euclidean space. It can be 
shown, for example, for any fixed p, the doubling di- 
mension of d-dimensional l p is roughly d. On the 
other hand, [T3J showed a family of metrics (Gk,dk) 
with bounded doubling dimension, whose embedding 
into I2 must incur a distortion of fi(y4og|G/t|). For 
a finite metric space X, ax < log|X|. For many 
problems involving metrics, better results are known 
if the metrics have bounded doubling dimension. |16] 
obtained an 0(y/ax logn) distortion embedding of X 
into I2, an improvement over Bourgain's theorem([T]) 
if the metric has low doubling dimension. [3] give bet- 
ter approximation algorithms for metric labeling and 
0-extension for a-decomposable metrics(if a metric 
has doubling dimension a, it is O(a)-decomposable). 



1.1 Our contribution 

We study approximation algorithm for EMD over 
doubling metrics, a generalization of low dimensional 
Euclidean spaces. As far as we know, this is the first 
work to consider the EMD over this family of metrics. 
Our algorithm is a generalization of the algorithm in 
[14] for approximating EMD over planar grids. We 
defined a metric, called "sibling-linked hierarchical 
well-separated tree" metric, that the doubling metric 
can be embedded into with constant distortion. This 
metric has a nice property that the EMD over it is 
the sum of EMD over smaller metrics, which allows 
us to do "importance sampling" . In this paper, we 
also promoted a scheme called "encoding scheme" , a 
weaker form of the sketching scheme. In the encoding 
scheme, distributions are encoded in some form, and 
the EMD of two distributions can be approximated 
in sub-linear time, if the encodings of the two dis- 
tributions are given. The sub-linear time estimation 
algorithm does the importance sampling in a binary- 
search way. using only logarithmic time. 

Compared to the Euclidean spaces, there is an is- 
sue for doubling metrics on how the underlying met- 
ric is given. Reading the whole metric requires time 
quadratic in n, the size of the metric, while we're 
aiming for algorithms with time near linear in n. We 
avoid this bottleneck by preprocessing, in which our 
algorithms read the metric (X, dx) and create a data 
structure of small size. By doing this, subsequent 
queries for EMD between /1 and v can be answered 
in time near linear in n. 

Throughout, we use ax to denote the doubling di- 
mension of X. We assume each coordinate in a dis- 
tribution can be represented in polylog(n) bits. In 
particular, for a fixed constant number a, we define 

This restriction is only for simplicity of demonstra- 
tion and doesn't affect our algorithms. 

Our first result is an almost linear time, con- 
stant approximation algorithm for EMD over dou- 
bling metrics. 

Theorem 1.1 (Approximation algorithm). Let 

(X, dx) be a metric space with \X\ = n. Let < e < 1 
be fixed. Given (X,dx) and ax, there is an algorithm 
which, by using 0{n 2 ) preprocessing time to create 
a data structure of size 0(n 1+t ), for any subsequent 
query for EMD between two distributions fi, v G Vx, 
outputs in 0(n) time a random estimation D satisfy- 
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ing 

EMD x {ii,v)<D<0(^p)EMD x {iJL,v) (1) 

with probability at least 2/3. The probability is over 
the randomness for the preprocessing as well as the 
estimation. 

The preprocessing time is quadratic in the size of 
the metric, which is unavoidable since we have to 
read the whole metric. However, the algorithm only 
reads the metric once, after which subsequent EMD 
queries can be answered in almost linear time in n. 
The probability 2/3 can be amplified to 1 — e, by 
repeating the algorithm 0(log(l/e)) times. 

Unlike 0], our algorithm for theorem 11.11 is not 
sketching-based. Instead, we provide a weaker 
scheme, which we call "encoding scheme". In the 
encoding scheme, distributions are encoded to l\ vec- 
tors by an encoding function F . Given two encodings 
F(n) and F(y), there is a sub linear time algorithm 
approximating EMDx(n, v). The difference between 
an encoding and a sketch is, an encoding is not nec- 
essarily shorter than its pre-image. 

Theorem 1.2 (Encoding scheme). Let X,dx ,n,e be 
as in theorem \l.ll Given (X,dx) and ax, there is 
an algorithm, which uses 0(n 2 ) preprocessing time to 
create a data structure of size 0(n 1+e ), such that for 
any subsequent query for EMDx(p-,v), it can per- 
form the following 3 steps: 

1. computes an encoding F(fj,) of size 0(n) for n 
in 0(n 1+e ) time, only reading \i and the data 
structure; 

2. computes F(f) similarly; 

3. outputs a random number D satisfying ([!]) with 
probability 2/3, in 0{n e ) time, only reading 
F(fi), F(y) and the data structure. 

Notice that compared to the algorithm in theorem 
11.11 the algorithm in theorem 11.21 requires 0(n 1+e ) 
time to approximate the EMD. The encoding scheme 
implies the following two theorems. 

Theorem 1.3 (Communication protocol). Let 

X,dx,n,e be as in theorem Consider a game 

with two players Alice and Bob, in which Alice knows 
fi £ Vx o,nd Bob knows v £ Vx ■ There exists a com- 
munication protocol with complexity 0(n f ), through 
which Alice and Bob can output a number D satisfy- 
ing (Q]) with probability at least 2/3. 

Theorem 11.31 suggests that proving a good lower 
bound for sketching using general communication 
complexity is impossible. 



Theorem 1.4 (Distance oracle). Let X,dx,n,e be 
stated as in theorem \l.ll Given (X,dx), otx an d 
s distributions /ii,/i2, • • • , fj, s £ Vx, there is an algo- 
rithm which uses preprocessing time 0{n 2 + sn 1+e ) to 
construct a data structure of size 0(n 1+e + sn), and 
for any subsequent query for EMD between /i = fj,i 
and v = fij, outputs a number D in 0{n e ) time, sat- 
isfying flj) with probability at least 2/3. 

1.2 Preliminaries 

The doubling dimension ax of a metric {X, dx) is 
defined as the minimum t such that every ball in 
X can be covered using 2* balls of half the radius. 
Define Rx = minp inX maxq £ Xdx(p, q) to be 
the radius of X. When X is clear from the con- 
text, we may omit the subscript, using a, R instead. 
For a point p £ X and a real number r, we use 
Ball(p,r) = {q : dx(p,q) < r} to denote the ball of 
radius r centered at p. 

For two distributions \i,v £ Vx, the earth mover 
distance (EMD) between fi and v is defined as 

EMD x (fi,v) = min } d x (p,q)Tr(p,q) 

jr:XxX->l *■ — ' 

where it ranges over all functions 7r satisfying 
Vp,q£X,n(p,q) > 0; 

v p e x < ^2 7r (??;<?) = 

pex 

We'll use l\ to denote the set of L\ vectors of finite 
dimension, i.e l\ = (J£i ^ ■> equipped with L\ norm. 
We'll use © to denote the direct product operation. 
So, x\ © X2 is the direct product of x\ and x%, and 
(Bi=i x s is the direct product x±, X2, ■ ■ • x s . 

The Cauchy distribution C(a;o:7) is the distri- 
bution with the following probability density func- 
tion : f(x) = i ( (x _ X0 7 )^ + ^ ) , where 7 is called 
the scale parameter of the distribution. The cu- 
mulative distribution function of C(xq,j) is F(x) = 

1 arctan ^ a ~ a:Q ^ + \. Cauchy distribution is 1-stablc 
distribution, meaning that the sum of n independent 
variables from C(0, 1) is C(0, n). 

Throughout, we'll use 0() notion to hide a factor 
of polylog(n). 

2 Overview of the algorithms 

Our algorithms for theorem 11.11 and theorem 11.21 are 
only slightly different. We'll describe the major com- 
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poncnts of the algorithms here, and give the details 
in following sections. 

In the preprocessing, we read (X, dx) and decom- 
pose it into a distribution of so called sibling-linked 
hierarchical well separated trees, or SLHSTs, with dis- 
tortion 0(a/e), where e is some parameter. The name 
is a little confusing since SLHST is actually not a tree 
(recall that even grids can not be embedded into dis- 
tribution of trees with distortion o(logn)). A SLHST 
is constructed by adding links connecting children of 
the same vertex (or siblings) to a base HST. In a HST, 
there is a unique path connecting two leaves. While 
in SLHST, we take a shortcut for the this path: the 
two children of the last common ancestor of the two 
leaves are directly connected. A perfect analogy for 
this metric is the postal service system. In the postal 
service system with a hierarchy of post offices, a pack- 
age is sent from a terminal to a local post office, and 
then to one with higher level in the hierarchy, and 
then to a even higher level, until it can be sent to 
an office of the same level. Then the package is sent 
to lower levels along the hierarchy system, until it 
reaches the destination. 

Our embedding into distribution of SLHSTs is 
dominating, and has small distortion. Then, by a 
similar argument as the one in [5], the EMD over 
X can be approximated by the EMD over a random 
SLHST from the distribution. A nice property that a 
SLHST has is, the EMD over it can be represented as 
a sum of 0(n) EMDs over smaller sub metrics. Recall 
a special case in [JJ], the EMD over a grid of size n 
is the sum of EMDs over grids of smaller size. 

The decomposition into smaller EMDs allows us to 
use the "importance sampling" technique, which is 
also used in [T3] . Suppose we want to compute a sum 
Z = J2i=i Z%i but it's too expensive to compute all 
the ZjS. We can choose a term Zi with probability 
roughly proportional to how much it contributes to 
the sum. Then Zi divided by this probability is used 
as the estimation for Z. The expectation of the value 
is Z, while the deviation depends on how well the 
sampling distribution approximates the weight distri- 
bution. Applying importance sampling to our case, 
we need a technique to approximate each EMD over 
a small metric up to a reasonable factor. This can be 
done using the embedding of metrics to distribution 
of dominating trees and then computing the EMD 
over the trees ([IT]. [5]). 

The above techniques provided us the main ele- 
ments needed to prove theorem 1 while not enough 
to prove theorem 11.31 To get a sub-linear time es- 
timation algorithm, we need to do the binary sam- 
pling more carefully. We invented a "binary impor- 



tance sampling" process, which allows us to do the 
importance sampling in logarithmic time. The high 
level idea is to sample in a binary-search way. Re- 
call in an importance sampling, we want to compute 
Z = ^2*—i Zi. We maintain a possible set S of terms, 
initially [N] . Each time the set S is divided into two 
equal subsets Si and S2 , and then replaced with one 
of two subsets, with probability proportional to the 
weight of the subset. In logiV steps, we end up with 
a single term in S, which is the term selected by the 
algorithm. This method requires us to estimate the 
sum of a subset of terms. In our case, each term 
can be approximated by the L\ norm of a vector, so 
the total weight of a subset is the sum of Li norms, 
which is equivalent to one Li norm. We can use the 
sketch scheme in [17] to approximate the Li norm. 
For every potentially possible set S, we have a sketch 
for the concatenation of vectors in S. The sketch is 
linear, a crucial property allowing us to encode two 
distributions separately. We'll describe the encoding 
function and the binary sampling algorithm in section 

El 

3 Sibling-linked hierarchical 
well-separated trees 

We introduce the definition of fc-HST from [6] and 
then based on this definition, we define a /c-SLHST. 

Definition 3.1 (fc-Hierarchical well-separated 
tree([B])). A k -hierarchical well-separated tree(k- 
HST) is defined as a rooted weighted tree with the 
following properties: 

1. The edge weight from any node to each of its 
children is the same; 

2. The edge weights along any path from the root to 
a leaf are decreasing by a factor of at least k. 

We call k the scale-decreasing factor of the HST. 
For a tree fc-HST r, we use V T , U T , deg{r) and dep(r) 
to denote r's vertices, non-leaf vertices, degree and 
depth, respectively. For some v G U T , let A(v) be 
the set of u's children, T(v) be the set of v's offspring 
leaves and be the length of edges from v to its 
children. 

Definition 3.2 (^sibling-linked hierarchical 
well-separated tree). Let t be a k-HST. We as- 
sociate each v e U T a metric d v : A„ x A, 1 
satisfying 

Vu, w G A(u), d v (u, w) < 2A V (2) 
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For some v G U T and u,w G A(i>), a sibling link 
(u,w) is an edge of length d v (u,w) connecting u and 
w. We call the graph obtained by combining t and all 
the sibling links a /c-sibling linked hierarchical well- 
separated tree(k-SLHST). 

For a fc-SLHST T, we use dx to denote its shortest 
path metric. The notions such as dep(r), deg(r), A. v , 
A v and F„ are naturally extended to fc-SLHSTs. We 
say a SLHST T supports X if X is the leaves of T. 

To avoid confusion, for a SLHST T that supports 
X , we always use p, q to denote points in X , as well 
as the leaves of T, and u, v to denote inner vertices 
in a SLHST, if possible. 

3.1 Embedding X into a distribution 
of SLHSTs 

Now, we are going to describe our algorithm for em- 
bedding metric [X, dx) to a distribution of dominat- 
ing SLHSTs. The algorithm is almost the same as 
the HST embedding algorithm in [TT], except that 
we use n 1 / 6 as the scale-decreasing factor, instead of 
2. The embedding is pretty bad (it will incur a dis- 
tortion of 0(n e )). As we'll show, after we insert the 
sibling-links to the HST, the distortion becomes very 
small. We partition a set of radius R to clusters of 
radius r in log(i?/r) steps and in each step we reduce 
the radius by a factor of 2. 

Despite of its similarity with the tree embedding 
algorithm of , we give our algorithm here, for the 
integrality of the paper. Algorithm [1] partitions a 
metric Y in to sets of smaller radius, and it will be 
used in by algorithm [2J 



Algorithm 1 partition(Y, dy , r) 
Input: A metric (Y,dy) and a scale r, r is guaran- 
teed to be at least i?y/2; 

Output: A partition of Y into clusters 
of radius smaller than r : {Yi : i G [s]} 

l: Select a r/2-net S C Y; 

2: Randomly choose a permutation 7r for S and a 

number /3 G [1,2) 
3: for i = 1,2,- •• do 
4: Y z <- Bally (7r(<), Pr/2) \ \JrJ x Y r , 
5: end for 

6: return {Y t : i e [\S\],Yi ^ 0} 



Claim 3.3. The returned {Yi :i£s} from algorithm 
[7] is indeed a partition, i. e, Vi ^ j, Yi f~)Yj — and 



Claim 3.4. The number of clusters that algorithm]]] 

returns is at most (4R Y /r) aY < 8 aY , for r > Ry/2. 

Algorithm[2]describes how a SLHST is constructed. 
The tree is essentially a laminar tree of the X, where 
each vertex in the tree is correspondent to a sub- 
set of X. A vertex vy at level i (the root has level 
0) has A VY = r"j, and is associated with a metric 
d VY (vY',VY") = dx{cy, Cy") +4rj+i, where cy> and 
cyti are the centers chosen for Y' and Y" respec- 
tively. The additive term 4ri + i is to make sure that 
the SLHST metric is dominating. 

Algorithm 2 embedding JntoSLHST(X,dx,e) 
Input: A metric (X, dx) and < e < 1/3 such that 
l/e is an integer; 

Output: A SLHST 

T\ 

i: h<- \ogR x ; 

2: C^{X},C ^(b,V T = {vx},E T ^Q); 
3: for i <— 1, • • • , h do 
4: for Every Y G C do 

5: C" <- partition(Y,dx\Y,2 h ~ i ); 
6: for Every Y' G C" do 

7: V T ^V T + {Y'},E T ^E T + {(V Y ,VY>)} 

8: end for 

9: C'^C' + C"; 
10: end for 
ll: e<-C',C'^-0; 
12: end for 

13: a 4— [elogn/axl, & random integer from [a]; 

14: L = {i : < i < /i, i = 6(mod a)} + {0, h}; 

15: Shrink r at level set L: we remove all level-z ver- 
tices for all i L, and directly connect level- j 
vertices to their level-i ancestors using edges of 
length 2 ,l ~ l , for i < j and i,j adjacent numbers 
in L; let r' be the new tree. 

16: Let i < j be two adjacent numbers in L, for a 
level-i vertex vy and two level- j vertices vy> , Vy" 
whose parent in t' is vy(so, Y',Y" C Y), d VY — 
dx(cy ,cy"), where cy> is the center of ^'(i.e, 
the radius of Y' is the maximum distance from 
cy to some other point in Y'). 

17: return the SLHST T defined by r' and d VY s. 



To avoid confusion, we use "rank" instead of "level" 
to denote the positions of vertices in r, r' and T. We 
say a vertex v G v T has "rank" i, if the path from v to 
the root in t contains i edges. The rank of a vertex in 
T(or t') is just the rank of its correspondent vertex 
in t. Notice that if some vertex v G Ut = U T > has 
rank i, then i € L. 

Lemma 3.5. The algorithm fj| actually returns a 
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SLHST, i.e, the associated metrics satisfy equation 
HP. Furthermore, deg(T) < . 

Proof. Let vy € be a rank i vertex and vy , Vy>> 
be two of its chilren in T. Notice that Vyi , vy" has the 
same rank j, where i and j are adjacent in L. Then 
dvy{vY.,VY") = dx(c Y >,c Y ») < 2R Y < 2 h - l+1 = 
2A VY . 

The degree of r is 2 0{ - a \ by claim E31 The 
shrinking operation collapses a levels into 1 level, 
and thus, the degree becomes at most 2°^ a = 

20 (a) \t log n/a~] _ n O(e.) _ |-| 

Lemma 3.6. The random SLHST T that algorithm 
[H returns supports X and satisfies: 

1. Vp,q e X,d T (p,q) > d x (p,q); 

2. Vp,qeX,E T [d T (p,q)} <0(a/e)d x {p,q). 

Proof. Fix two points p and q. Let vy, vy> be the two 
highest-rank vertices in the shortest path between p 
and q in T, and v be their parent. Let the rank of vy 
and vy be i. If i = h, clearly drip, (?) = dx{p,q)', if 
i < h, 

dr(p,q) > d v (v Y ,v Y >) 

= d X (CY,Cy) + 2 h - i+1 

> d x (p,q) ~ d x (p,c Y ) - d x (q,c Y >) + 2 h - ,+1 

> d x (p, q) - 2 h -' 1 - 2 h - 1 + 2 h - l+1 
= d x (p,q); 

So, we've proved the first property. 

Let Pi be the probability that p and q are separated 
at rank i in r. Obviously, 2i=i Pi = 1j an d horn 
we have 

ft 

5^2^ < 0(logn)dx(p,g) 

Now, we fix a tree r. Suppose p and g are separated 
at rank i in r. Let uy. and oy< be the rank j ancestors 
of p and q, respectively. Hi < h— a + 1, p and q may 
be separated at rank i, i+ 1, • • • , i + a — 1 in r', each 
with probability 1/a. The expected drij>,q) 1 over all 
possible 6s, denoted by Di, is at most 

i+a—i I h—i \ 

i=* \ j v =i / 

«+a— 1 



< ~ ^ (dx(p,g) + 2' l ^' + 2 h -J+4x 2 h - 
12. 



If p and q are separated at rank i in r, for some 
i > /i — a + 1 , £>i is at most 

1 / * \ 

- 5Z rfx ( Cy 3 j C17) + (* + - 1 _ (p, q) 

h I h-1 \ 

< d x (p, g) + - ^ 2^ + 2^ + 2 2^" 



12, 



< d x (p,q) + -2 h - 1 
a 

W.[dr(p, q)] Taking the expectation over r and b, we 



get 



E[d T {p,q)] = 5Z PlA 

i=l 

i=l ^ a J 



12 



< 53p 1 ^( J3 , (? ) + _53p2' 1 

z — ' a ' — ' 



i=i 



12 



a 



h — i 



< d x (p,q)-\ 0{\ogn)d x (p,q) 

a 

< 0(a/e)d x {p 1 q) 

The last inequality comes from the fact that a — 
[elogn/a]. □ 

3.2 Decomposition of EMD over a 
SLHST 

Before showing the main lemma of this subsection, 
we introduce some notions. 

Let T be a SLHST that supports X. For a dis- 
tribution /i G Vx and a vertex v <E Ut, define 
t L v = E pe i»Mp and fi v = ueA( „) M«. Define ex- 
tended EMD (or EEMD) between jl v and t>„ to be 

EEMDJp, v ,v v ) = min EEMD^(u v ,u v ) 

7r:A„ x A„-yR 

EEMD^(fi v ,P v ) = 52 K(u,w)d v {u,w) 

+ a„ 52 - 5Z 7r "' ,i 
+ a„ 52 ^ ~ X! 7^ "• , ' 

where 7r ranges over all transportation functions sat- 
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isfying 



Thus, 



Vu, w £ A„, 7r(u, u>) > 0; 
\fu £ A„, ^ 7r(u,u;) < fi u ; 



tuEA„ 



Vw € A„, 7r(tt,ii;) < !/„ 



«eA„ 



The definition says that all the unmatched units 
must be sent to v. 

Lemma 3.7. 

EMD T (pL,v)= EEMD v (fi v ,0 v ) 

v£U T 

Proof sketch. We can view distributions /x and v as 
supplies and demands on the leaves of the tree. The 
allowed operation is moving e amount of supplies (or 
demands) along some edge (u, v), which costs el(u, v). 
We can cancel out the same amount of supplies and 
demands on the same vertex for free. We can further 
restrict the moving direction, so that the moves can 
not go downwards. 

We show that the best strategy is to match sup- 
plies and demands locally, i.e, it can not move sup- 
plies along (u, v) while moving demands along (w, v), 
for some v £ Ut and u,w £ A„. Otherwise we 
would moving some amount of supplies directly from 
u to w and then match the same amount of de- 
mands at w. This will make the cost smaller, since 
d v (u,w) < A v + A v . Thus, in the best matching, 
each vertex v will receive max(/j„ — v v , 0) amount of 
supplies and max(V„ — fjt, v , 0) amount of demands and 
cancels out all but | [i v — v v | amount of supplies or 
demands, which will be sent to its parent. This is 
exactly the sum of EEMD v s over all v £ Ut- □ 

Lemma 3.8. If X can be embedded into a set y of 
dominating metrics, with distortion (3, then, EMDx 
can be embedded into {EMDy '■ Y £ y}, with distor- 
tion f3. 

Proof. It's easy to see that EMDy dominates 
EMD X . Let tt : X x X -> R be the best trans- 
portation function for EM Dx (p,v), and ip : Y xY ^ 
be the best transportation function for EMDy^i, v). 
By the definition of EMD, we have 

EMD x {p.,v) = Y Tr(p,q)d x {p,q) 
p,qex 

EMD Y {n,v) = Y ipY(p,q)d Y (p,q) 
p,qex 



E EMDyU.v) 
Yey 



E Y i ) Y{p,q)dy{p,q) 



Yey 



p,qex 



< y ^ y Tr(p,q)dy(p,q) 



p,qex 



= Y n (p>i) Y ^ y d Y(p,q) 

p,q&x 

< Y 7r (p>q)l 3d x(p,q) 

p,q&x 

< /3EMD x (n,v) 



□ 



Lemma 3.9. If T be the distribution of dominating 
SLHSTs that O (a x / e)- approximate X, then 



EMD x {n,v) < ij2 EEMD Av {fi v ,v v ) 

< 0(ax/e)EMD x (»,v) 



Proof. This lemma immediately follows lemma 13.71 
and lemma E31 □ 



4 Importance sampling, proof 
of theorem 11.11 

After we decomposed EMD X into J2veu T EEMD V , 
we'll use importance sampling to choose a element 
v £ Ut, as mentioned in section O The probabil- 
ity that v is selected should be roughly the ratio of 
EEMD V and EMD T . The following lemma from 
[14] tells us how many samples are needed. 

Lemma 4.1 ([14 ). Let Z x , ■ ■ ■ , Z s > 0, Z = £\ Z t 

and qi = Zi/Z. Let p±,p2, ■ ■ • ,Ps be some numbers 
satisfying pi > qt/j and ■ , pj = 1, for some 7 > 
1. Consider a random variable S such that Pr[S = 
Zi/Pi] = Pi', note that E[S] = Z . Then, 



Pr 



E Si-Z 

ie[M] 



> 0.5Z 



< e -n(M/ T ) 



where Si, S2, • • • , Sm ore independent copies of S. 

Proof. With probability 1, 

Zi Zqi 
< S < max — = max < 7Z 

ie[s] pi ie[s] p l 



By Chernoff bound, we have 



Pr 



E Si-Z 
ie[M] 



> 0.5Z 



< e -n(M/ T )_ 
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□ 



To apply lemma 14.11 we need a way to approxi- 
mate each EEMD V within a reasonable ratio. Em- 
bedding the metric d v into distribution of trees can 
give an estimation that is always at least EEMD V , 
and at most a logarithmic factor times EEMD V in 
expectation(see [5] and [H]). However, to allow a 
good estimation, the number of trees in the sup- 
port should be J7(|A„|). Then, we need roughly 
J2veu T \^"\ = 0(nmax„ |A„|) running time for the 
importance sampling, as opposed to 0(n) stated in 
the theorem. We can do slightly better by using the 
min of the EMDs, instead of the average. 

Lemma 4.2. Let (Y,dy) be a metric space such that 
\Y\ < n, and T be a distribution of dominating trees 
that approximate (Y,dy) up to O(logn) factor, i.e, 
E re r d T (p,q) < 0(logn)dy(p, g). If we randomly 
choose s = O(logn) trees t\,T2,-" ,t s from T, then 
for every pair of distributions /i, v over Y , 

min EMD T . (/i, v) < 0(\ogn)EMD Y {^,v) 
ie[s] 

with probability at least 1 — 1/n. 

Proof. For 1 tree r, with probability 1/2, 
EMD T (n,v) < 0(logn)EMD Y (p,v)- So, with 
probability at least 1 — 1/n, there exists a i £ [s] 
satisfying EMD n (/x, v) < O (log n)EMD Y (fJ,, v). □ 

Now we can proceed to prove theorem 11.11 

Proof sketch of theorem ] 1. 11 In the preprocessing, 
we run algorithm [5] to construct a SLHST T, with 
e to be decided shortly. For each metric (A v ,d v ) 
associated with an inner vertex v € Ut, we choose 
s = 0(log n) dominating trees t v> i,t v ,2, ■ • ■ , t v , s from 
the distribution of dominating trees that O(logn)- 
approximates d v . The data structure includes T, all 
T v .iS and all sub-metrics of X restricted to A„ for 
v € U T - 

The time to sample a SLHST is 0(n 2 ) and the 
time to compute all r v> i is still 0(n 2 ), implying the 
preprocessing time is 0(n 2 ). 

The size of the data structure is 0(J2v£U T \^""\ 2 ) = 
ri 1 +o(e)_ \A V \ 2 space is for small metrics d v s and 
sub-metrics of dx restricted to A^s. 

Let fi,v € Vx be the two distributions in the query. 

With probability at least 0.9, the following two 
things happen: 



2. For every v € Ut, min ie [ s ] EEMD Ti v (fl v ,i> v ) < 
0(\ogn)EEMD v (fi Vl u v ); 



1. EMD T 



Suec; EEMD V approximates 



EMDx with approximation ratio 0(a/e) 



by lemma 13.91 and lemma 14.21 

Assuming the above two things happen, 
we can do the importance sampling, using 
minj e M EEMD Ti v (£l v , P v ) as the weight for v. 
The probability that v is selected in the impor- 

tance sampling is at least EEMD ^) 
1 6 0{\ogn)EMD T 
by lemma 14.11 we only need O(logn) samples to 
approximate EMDt{^, v) up to a constant factor. 
This total time for importance sampling is 0(n), 
since £„ e(7T |A„| = 6{n). 

Let v be a vertex selected by the importance sam- 
pling. We compute EEMD v (fj, v ,i) v ) by using stan- 
dard Hungarian algorithm. |A„| is at most n°( e \ for 
small enough e, computing EEMD V takes o(n) time. 

In all, with preprocessing time 0(n 2 ), data 
structure of size 0(n 1+e ), we can approximate 
EMDx{n,v) up to a 0(ax/e) factor in 0(n) time. 
Notice that we can remove the 0(.) notion in the 
exponent, by losing a constant factor in the approxi- 
mation ratio. □ 



5 Binary importance sampling, 
proofs of theorem [L2], [TT3] and 

era 

In this section, we describe an encoding scheme, 
where each distribution [i is encoded into a linear 
code F{p) and EMDx{^,v) can be approximated 
using sub-linear time, when F(fi) and F(v) are given. 

Recall that, in the algorithm for theorem 11.11 we 
need to do the importance sampling, where we esti- 
mate EEMD V for every v £ T. To design a sub- 
linear estimation time algorithm, we must avoid esti- 
mating EEMD V for every v. 

This can be done by using a binary sampling 
method where, we maintain a set S of possible out- 
puts, initially Ut and during each iteration, set S is 
divided into 2 equal subsets, and replaced with one of 
the subset with probability proportional to the weight 
of the subset. The process repeats 0(log n) times, un- 
til there's only 1 element left in S, which is the output 
of the importance sampling. 

The above method requires a good estimation for 
the total weight of a subset. In section SI we used 
the min of 1% norms as the estimation for the weight 
of an element. Then the total weight of a subset 
is the sum of mins, which seems hard to estimate. 
We'll use a different estimation : the sum of l\ norms, 
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which is still a l\ norm. The total weight of a sub- 
set is again the sum of l\ norms, equivalent to a 
li norm. There are good sketching schemes for l\ 
metric, for example, |17] used a random linear map 
g : li — > R k as the sketching of a l\ vector, such that 
the L\ norm of x is approximated by the median of 
\gi{x) \ , \gi{x)\ , • • • , \gk(x)\, where g t is the i-th coor- 
dinate of g. In the sketch function, each gi is a linear 
function of x, where the linear coefficients are chosen 
from the Cauchy distribution of scale parameter 1. i.e 
9i = J2j a j x j' a j ~ ^(0) !)■ The value k determines 
how well the norm is approximated, as in lemma RT2"1 
Before giving the lemma, we first define: 

Definition 5.1 (p-good). For some < p < 0.1, we 
say the sketch function g is p-good for a fixed x, if 

(l-p)Mi < median(\g 1 (x)\,\g 2 (x)\,--- ,\g k (x)\) 

< (i + p)Wi 

where gi,gi,--- , fffe are k coordinates of g. 

Lemma 5.2. Let g : l\ — > R fc be the l\ sketch func- 
tion, with k = c/ p 2 for some < p < 0.1 and integer 
c. Then, for a fixed x G l\, g is p-good with probability 
at least 1 - e~ n( - c \ 



We leave the proof of lemma 15.21 to the appendix. 
For a set of N elements, we fix a binary sampling 
tree, where each node is a subset of [N]. A set is 
equal to the union of its two child sets of half the 
size. The root of the tree is [TV], and the N leaves 
are {1} , {2} , • • • , {N}. Let Sn be the family of all 
subsets of [N] in this binary sampling tree; notice 
that \S N \ = O(N), and J2ses N \ S \ = 0(N). Our 



encoding / : (li 



is defined as 



f(xi,x 2 ,--- ,x N )= (J) 5 © : 



(3) 



Algorithm [3] is the binary importance sampling al- 
gorithm. It takes the encoding / for N vectors in 
l±, and selects a number i G [N], with probability 
supposed to be \xi\ / J2 e[N] \ x i\- 

Claim 5.3. P t is the probability that algorithm^ se- 
lects t, as stated. 

Definition 5.4 (successful). We say a encoding f 
is successful for fixed x±,X2, ■ ■ ■ ,xn, if algorithm 



selects t with probability at least 0.5 \xt\ 1 / © 



Lemma 5.5. Let x±, x%, ■■ ■ ,xn be fixed. If all occur- 
rences of g in the encoding f are l/(10logN)-good, 
f is successful. 



Algorithm 3 bjimport_sample(f(xi 1 X2, 



,x N )) 



Input: Encoding / for N vectors x\, Xi, ■ • • , x^; 
Output: A pair (t,P t ), where t £ [N] and 
P t is the probability that the algorithm selects 



U 



S^[N],P^1; 
while \S\ > 1 do 

Let Si and S2 be the two children of S in the 

binary sampling tree; 



<J 



(©. 



Extract g 1 = g (® jeSl x j) and 9 2 
from the encoding /; 



W\ median (| g\ | , | g\ | , • • ■ , | g\ | ) and W2 
median (\g 2 \ , \g$ \ ,\gl\); 
Let i be 1 with probability W ^} W2 and 2 with 
probability w ™ 2 m ; 

S Si, P <— P x Wl + W2 ; 
end while 

return (t, P) where t is the unique element in 
S; 



Proof. If all occurrences of g in f are 1/(10 log N)- 
good, the median of absolute values of the sketch will 
always give the l\ norm of the pre-image, up to a 
factor of 1 ± 1/(10 log A). Thus the probability that 
algorithm [3] selects t is at least 



1 - 



log N 



10 1 



Ft 



1 



10 log N 



®i = l x i 



> 0.5 



x t 



0i=l X i 



where the exponent log N comes from the depth of 
the binary sampling tree. By definition 15.41 / is suc- 
cessful. □ 

Then we apply algorithm [3] to compute the EMD 
over a SLHST metric. As shown in lemma |3~T1 EMD 
over a SLHST T is the sum of EEMD v s over all 
v G Ut ■ Each underlying metric d v can be embedded 
into distribution of dominating trees, with distortion 
logarithmic in the size of the metric. In order to 
make the preprocessing efficient, we need a small set 
of dominating trees. [7j gives exactly what we want : 

Theorem 5.6 ([7). Given a metric (X,dx) with 
\X\ = n, there is a set of 0(n log n) dominating trees 
and a probability on them, such that the metric dx is 
approximated by the set of dominating tree metrics, 
with distortion O(lognloglogn). Moreover, there is 
a polynomial algorithm which gives the set and the 
probability. 



Then, by lemma EOH (or jHJ), the EEMD over 
(v,d v ) can be embedded into the average EEMD 
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of 0(|A„| log |A„|) trees, which is equivalent to the 
norm of a (|A„| 2 log |A„|) dimensional Li, with dis- 
tortion 0(log|A„|loglog|A„|). We use h v : R A ^ -> 
rO(|A(ii)|-) ^ q denote the linear function whose l\ 
norm approximates EEMD V . 

For a SLHST T, let's list all the vertices in U T ■ 
v%, V2, ■ ■ ■ ,vn- Define : 

Ft(v) = f(h Vl (p, Vl ),--- ,h VN (fi VN ))® fi Vi (4) 

ie[N] 

Then, algorithm 0] shows how to approximate the 
EMD in sub linear time, if the encodings are given. 

Algorithm 4 approx-EM D(T, F T (fi), F T (v)) 
l: Sum <— 0; 

2: Extract / M = / (h Vl (p, Vl ) , • • • ,h VN (fi VN )) from 
F T (n), and f v = f (h Vl (u Vl ), ■ ■ ■ , h v N {vv N )) from 
F T (u); 

3: for i 1,2, • • • , [log 2 n] do 

4: (t,P) = bjimport_sample(f fi — /„); 

5: Extract fi Vt from -Ft(m) and v Vt from Ft{v)\ 

6: Sum <— Sum + EEMD Vt (fi Vi .,P Vt )/P; 

7: end for 

8: return Sum/ [log 2 n] . 



Lemma 5.7. J/ aiZ occurrences of f in Ft{h — 
v) = Fr(fj) — Ft(v) are successful, algorithm 
[^] outputs a number between Q.bEMDT{pi.,v) and 
1.5EM Dt^, v) with probability at least 0.9. 

Proof. If all occurrence of / in Ft (/i — v) are success- 
ful, algorithm [3] chooses t with probability at least 

0.5h t (fL t ,fi t ) 

> 0.5EEMDtUk,i>t) 

0(lognloglogn)X) ie [jv] EEMD^fi^Vi) 

1 EEMD t (ji u v t ) 
O(logTiloglogn) EMD T (fj,,v) 

By lemma 14.11 the probability that the algo- 
rithm outputs a number between 0.5EM Dt(h, v) 
and 1.5EMDt(h, v) is at least 

X _ g-^(log 2 n/ log n log log n) > Q g 

□ 

Proof sketch of iheorem U ."A HOI and \Tl\ We first 
prove theorem II .21 and then show how this it implies 
theorem 11.31 and 11.41 

The preprocessing stage is almost the same as the 
algorithm for theorem ll.il except that we choose s = 



0(|A„|].og|A„|) dominating trees t Vi i,t v , 2 , ••• ,r„ )S 
for v using the algorithm in [TJ. The preprocessing 
time is 0(n 2 ), and the size of the data structure is 
0(n 1+£ ). 

We can suppose we have EMDt approximates 
EMDx with 0(ax/e) distortion, because this hap- 
pens with high probability. 

Then we compute the encoding Fr{n) and Ft{v) 
using formulas §5§ and (U]). The time to compute 
/ is k times the dimension of the input, which is 
E v eu T 0(\K\ 2 ) = 0(n 1+e ) in F T (p). The second 
part of Ft (a*) can be ignored. The time to compute 
the encoding is 0(kn 1+e ). The size of the encoding 
is O(Nk) = 6(nk). 

If we set k to 0(log 3 n), then by lemma 15721 ev- 
ery occurrence of g is (9(l/logn)-good with prob- 
ability 1 - 0(l/n 4 ). Totally, there are 6{\U T \) = 
0(n) occurrences of g in Ft {ft — v), thus, with 
probability 1 — 0(l/n 3 ), every occurrence of g 
is 0(l/logn)-good, which implies, by lemma 15751 
f(h Vl (fi Vl ), - ■ ■ ,h VN (fj VN )) is successful. This finally 
implies algorithm will output a number between 
Q.5EMDt((J>, v) and l.bEMD T {^v) with probabil- 
ity at least 0.9, by lemma 15771 The number 0(«x/e)- 
approximates EMDx(p, v). 

Since k = (9(log 3 n), the time need to compute 
an encoding is 0{kn 1+e ) = 0(n 1+e ), and the size of 
an encoding is 0(nk) = 0(n). The binary impor- 
tance sampling takes time O(l), The bottleneck of 
algorithm [4] is computing the EEMD v s, which takes 
total time h°^). In algorithm [H we don't really 
compute Ft{h~ v). As the encoding is linear, when- 
ever we want to read a number from Ft(p — v), we 
read the numbers in Ft(p-) and Ft(v) and then do 
the subtraction. 

Theorem 11.21 immediately implies theorem 11.31 
Alice computes F(jjl) and Bob computes F(v), then, 
they communicate to simulate algorithm^] The com- 
munication complexity is 0(n e ), with approximation 
ratio ax/e. 

For theorem 11.41 we can store the encodings for 
all the s distributions in the preprocessing stage and 
when a query comes, run algorithm [~4j We can run 
the preprocessing stage O(logs) times independently 
and store O(logs) data structures so that with high 
probability, the EMD between every pair is reserved. 

□ 

6 Conclusions 

We have demonstrated a almost linear algorithm for 
estimating EMD over doubling metrics up to a con- 
stant factor. We also derived an encoding-based sub- 
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linear time constant approximation algorithm, which 
may have further applications. 

An interesting direction to purse would be to find 
a sketching scheme for EMD over doubling metrics. 
[4]'s sketching scheme benefits from the fact that all 
the small grids are the same. While in our case, the 
small metrics are different. If we can embed the EMD 
over the small metrics to any uniform normed space 
up to a constant factor, we can use the technique in 
[4] to derive a sketch scheme. 
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For some < 6 < 0.1, define Yi = 1 if gi < 
F~ 1 (l/2 - 8) or gi > F- 1 {l/2 + 6) and otherwise. 
Using Chernoff bound, we have 



Pr 



fc 

E 

i=l 



Yi > k/2 



< e 



-5 2 k/8 



If fc = c/S 2 , the above probability is at most e n ( c \ 
For small enough 6, F- 1 (l/2-5) = (1-0(6))-/ and 
F~ 1 (l/2 + 8) = (1 + 0(6))j. Yi < n/2 implies 

the absolute median falls between (1 — 0(6)) 7 and 
(1 + 0(5))-/. □ 



A Proof of lemma 15.21 

Proof. For a fixed x and i £ [fc], the distribution of 
gi(x) is C(0, \x\i). Let 7 = |x|i, and 



/(*) 



if t < 

77 t 2J r~{ 2 — 



be the probability density function \gAs. F is the 
correspondent cumulative function : 

f if t < 

F(<) = | larctan(^) if t > 
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